智能论文笔记

CAFE: Catastrophic Data Leakage in Vertical Federated Learning

Xiao Jin , Pin-Yu Chen , Chia-Yi Hsu , Chia-Mu Yu , Tianyi Chen

分类：机器学习 | 人工智能

2021-10-26

最近的研究表明，私人培训数据可以通过分布式机器学习系统（例如联合学习）（如联合学习）（如FL）泄露。增加批量大小以使数据恢复复杂化，通常被视为防止数据泄漏的有希望的防御策略。在本文中，我们重新审视该防御前提，并提出了一种高级数据泄漏攻击，具有理论上的理由，以有效地从共享聚合渐变恢复批量数据。我们将所提出的方法称为垂直联合学习（Cafe）中的灾难性数据泄漏。与现有数据泄漏攻击相比，我们对垂直流程的广泛实验结果展示了CAFE的有效性，以提高数据恢复质量。我们还提出了减轻咖啡馆的实际对策。我们的结果表明，私人数据参与标准FL，特别是垂直情况，具有从训练梯度泄露的高风险。我们的分析意味着这些学习设置中的前所未有和实际的数据泄漏风险。我们的工作代码可在https://github.com/derafael/cafe上获得。

translated by 谷歌翻译

Formalizing Generalization and Robustness of Neural Networks to Weight Perturbations

Yu-Lin Tsai , Chia-Yi Hsu , Chia-Mu Yu , Pin-Yu Chen

分类：机器学习

2021-03-03

研究神经网络中重量扰动的敏感性及其对模型性能的影响，包括泛化和鲁棒性，是一种积极的研究主题，因为它对模型压缩，泛化差距评估和对抗攻击等诸如模型压缩，泛化差距评估和对抗性攻击的广泛机器学习任务。在本文中，我们在重量扰动下的鲁棒性方面提供了前馈神经网络的第一积分研究和分析及其在体重扰动下的泛化行为。我们进一步设计了一种新的理论驱动损失功能，用于培训互动和强大的神经网络免受重量扰动。进行实证实验以验证我们的理论分析。我们的结果提供了基本洞察，以表征神经网络免受重量扰动的泛化和鲁棒性。

translated by 谷歌翻译

Adversarial Examples can be Effective Data Augmentation for Unsupervised Machine Learning

Chia-Yi Hsu , Pin-Yu Chen , Songtao Lu , Sijia Liu , Chia-Mu Yu

分类：机器学习 | 计算机视觉

2021-03-02

引起超越预测的对手示例被广泛用于评估和改善机器学习模型的鲁棒性。然而，目前的研究侧重于监督学习任务，依赖于地面真理数据标签，目标目标或从训练有素的分类器的监督。在本文中，我们提出了一种为无监督模型产生对抗性示例的框架，并证明了数据增强的新应用。我们的框架利用相互信息神经估算器作为信息理论相似度措施，以产生未经监督的对抗示例。我们提出了一种新的MinMax算法，可提供可提供的融合保证，以便有效地产生无监督的对抗性示例。我们的框架也可以扩展到受监督的对抗性示例。在使用无监督的对冲示例作为用于模型再检验的简单插件数据增强工具时，在不同无监督的任务和数据集中一直观察到显着的改进，包括数据重建，表示学习和对比学习。我们的结果表明，通过对抗示例研究和改善无监督机器学习的新方法和相当大的优势。

translated by 谷歌翻译

Trajectory Smoothing Using GNSS/PDR Integration Via Factor Graph Optimization in Urban Canyons

Yihan Zhong , Weisong Wen , Li-Ta Hsu

分类：机器人

2022-12-29

Accurate and smooth global navigation satellite system (GNSS) positioning for pedestrians in urban canyons is still a challenge due to the multipath effects and the non-light-of-sight (NLOS) receptions caused by the reflections from surrounding buildings. The recently developed factor graph optimization (FGO) based GNSS positioning method opened a new window for improving urban GNSS positioning by effectively exploiting the measurement redundancy from the historical information to resist the outlier measurements. Unfortunately, the FGO-based GNSS standalone positioning is still challenged in highly urbanized areas. As an extension of the previous FGO-based GNSS positioning method, this paper exploits the potential of the pedestrian dead reckoning (PDR) model in FGO to improve the GNSS standalone positioning performance in urban canyons. Specifically, the relative motion of the pedestrian is estimated based on the raw acceleration measurements from the onboard smartphone inertial measurement unit (IMU) via the PDR algorithm. Then the raw GNSS pseudorange, Doppler measurements, and relative motion from PDR are integrated using the FGO. Given the context of pedestrian navigation with a small acceleration most of the time, a novel soft motion model is proposed to smooth the states involved in the factor graph model. The effectiveness of the proposed method is verified step-by-step through two datasets collected in dense urban canyons of Hong Kong using smartphone-level GNSS receivers. The comparison between the conventional extended Kalman filter, several existing methods, and FGO-based integration is presented. The results reveal that the existing FGO-based GNSS standalone positioning is highly complementary to the PDR's relative motion estimation. Both improved positioning accuracy and trajectory smoothness are obtained with the help of the proposed method.

translated by 谷歌翻译

Cross-Resolution Flow Propagation for Foveated Video Super-Resolution

Eugene Lee , Lien-Feng Hsu , Evan Chen , Chen-Yi Lee

分类：计算机视觉 | 人工智能

2022-12-27

The demand of high-resolution video contents has grown over the years. However, the delivery of high-resolution video is constrained by either computational resources required for rendering or network bandwidth for remote transmission. To remedy this limitation, we leverage the eye trackers found alongside existing augmented and virtual reality headsets. We propose the application of video super-resolution (VSR) technique to fuse low-resolution context with regional high-resolution context for resource-constrained consumption of high-resolution content without perceivable drop in quality. Eye trackers provide us the gaze direction of a user, aiding us in the extraction of the regional high-resolution context. As only pixels that falls within the gaze region can be resolved by the human eye, a large amount of the delivered content is redundant as we can't perceive the difference in quality of the region beyond the observed region. To generate a visually pleasing frame from the fusion of high-resolution region and low-resolution region, we study the capability of a deep neural network of transferring the context of the observed region to other regions (low-resolution) of the current and future frames. We label this task a Foveated Video Super-Resolution (FVSR), as we need to super-resolve the low-resolution regions of current and future frames through the fusion of pixels from the gaze region. We propose Cross-Resolution Flow Propagation (CRFP) for FVSR. We train and evaluate CRFP on REDS dataset on the task of 8x FVSR, i.e. a combination of 8x VSR and the fusion of foveated region. Departing from the conventional evaluation of per frame quality using SSIM or PSNR, we propose the evaluation of past foveated region, measuring the capability of a model to leverage the noise present in eye trackers during FVSR. Code is made available at https://github.com/eugenelet/CRFP.

translated by 谷歌翻译

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu , Yixiao Ge , Xintao Wang , Weixian Lei , Yuchao Gu , Wynne Hsu , Ying Shan , Xiaohu Qie , Mike Zheng Shou

分类：计算机视觉

2022-12-22

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

translated by 谷歌翻译

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Wei-Ning Hsu , Tal Remez , Bowen Shi , Jacob Donley , Yossi Adi

分类：计算机视觉 | 机器学习

2022-12-21

Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.

translated by 谷歌翻译

Multi-hop Evidence Retrieval for Cross-document Relation Extraction

Keming Lu , I-Hung Hsu , Wenxuan Zhou , Mingyu Derek Ma , Muhao Chen

分类：自然语言处理 | 机器学习

2022-12-21

Relation Extraction (RE) has been extended to cross-document scenarios because many relations are not simply described in a single document. This inevitably brings the challenge of efficient open-space evidence retrieval to support the inference of cross-document relations, along with the challenge of multi-hop reasoning on top of entities and evidence scattered in an open set of documents. To combat these challenges, we propose Mr.CoD, a multi-hop evidence retrieval method based on evidence path mining and ranking with adapted dense retrievers. We explore multiple variants of retrievers to show evidence retrieval is an essential part in cross-document RE. Experiments on CodRED show that evidence retrieval with Mr.Cod effectively acquires cross-document evidence that essentially supports open-setting cross-document RE. Additionally, we show that Mr.CoD facilitates evidence retrieval and boosts end-to-end RE performance with effective multi-hop reasoning in both closed and open settings of RE.

translated by 谷歌翻译

Free-form 3D Scene Inpainting with Dual-stream GAN

Ru-Fen Jheng , Tsung-Han Wu , Jia-Fong Yeh , Winston H. Hsu

分类：计算机视觉

2022-12-16

Nowadays, the need for user editing in a 3D scene has rapidly increased due to the development of AR and VR technology. However, the existing 3D scene completion task (and datasets) cannot suit the need because the missing regions in scenes are generated by the sensor limitation or object occlusion. Thus, we present a novel task named free-form 3D scene inpainting. Unlike scenes in previous 3D completion datasets preserving most of the main structures and hints of detailed shapes around missing regions, the proposed inpainting dataset, FF-Matterport, contains large and diverse missing regions formed by our free-form 3D mask generation algorithm that can mimic human drawing trajectories in 3D space. Moreover, prior 3D completion methods cannot perform well on this challenging yet practical task, simply interpolating nearby geometry and color context. Thus, a tailored dual-stream GAN method is proposed. First, our dual-stream generator, fusing both geometry and color information, produces distinct semantic boundaries and solves the interpolation issue. To further enhance the details, our lightweight dual-stream discriminator regularizes the geometry and color edges of the predicted scenes to be realistic and sharp. We conducted experiments with the proposed FF-Matterport dataset. Qualitative and quantitative results validate the superiority of our approach over existing scene completion methods and the efficacy of all proposed components.

translated by 谷歌翻译

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Alexei Baevski , Arun Babu , Wei-Ning Hsu , Michael Auli

分类：机器学习 | 自然语言处理

2022-12-14

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.

translated by 谷歌翻译